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ABSTRACT 

High-throughput  DNA  sequencing  technology  was 
utilized  to  describe  the  protein  coding  regions  of 
genomic  DNA  (the  transcriptome)  for  both  Western 
Fence  Lizard  {Sceloporus  occidentalis,  WFL)  and 
Japanese  Quail  {Coturnix  coturnix,  JQ).  928,759  and 
559,819  total  transcriptomic  sequences  for  WFL  and  JQ, 
respectively,  were  clustered  and  assembled.  Assembled 
unigenes  with  lengths  >200  base  pairs  were  annotated 
using  Basic  Local  Alignment  Search  Tool  (BLAST) 
against  5  publicly  available  protein  sequence  databases 
using  the  DoD  supercomputers.  Diamond  (SGI  Altrix 
ICE)  and  Jade  (Cray  XT4).  A  total  of  58,962  and  44,455 
unigenes  were  identified  for  WFL  and  JQ,  respectively. 
Annotation  of  unigenes  via  similarity  search  against 
known  proteins  in  the  NCBI  NR.aa  and  Refseq,  EMBL- 
EBI  UniProt-SwissProt,  Uniref90,  and  UnireflOO  protein 
coding  databases  provided  44  and  33  %  unigene 
characterization  for  WFL  and  JQ,  respectively. 
Sequences  with  significant  similarity  to  known  proteins 
were  used  to  design  custom  ultra-high  density  gene 
expression  microarrays  which  are  being  used  to  develop 
innovative  methods  to  pro-actively  assess  the  impacts  of 
Army  activity  on  environmental  quality  on  installations. 
Further,  this  effort  has  developed  a  cyber-infrastructure 
capability  with  web-based  tools  and  data  visualization 
capability  for  the  ERDC  Environmental  Laboratory  to 
rapidly  develop  genomic  infrastructure  and  gene 
expression  tools  for  any  ecological  receptors  that  become 
species  of  concern. 


1.  INTRODUCTION 

The  utility  of  genomic  tools  has  been  broadly 
documented  for  the  advancement  of  the  biological 
sciences.  Although  genomic  tool  development  for 
ecologically-relevant  non-model  species  has  lagged 
relative  to  model  species,  advancements  in  sequencing 
technology,  bioinformatics  processing,  and  gene 
expression  platforms  have  led  to  an  increasing  number  of 
non-model  species  having  deep-coverage  and  well- 
annotated  transcriptomes  from  which  high-quality 
genomic  tools  have  been  produced  [Rawat  et  al  2010]. 
We  have  developed  a  bioinformatics  infrastructure  and 
data  processing  pipelines  to  transition  raw  sequence  data 
to  robustly  annotated  coding  genes  to  support  gene 
expression  profiling  and  biological  impact  assessment  of 
Amy  stressors  on  ecological  receptors 
(http://www.ifxworks.com/EnvironmentalSvstemsBiolog 

y.html).  These  tools  are  being  used  to  assess  gene 
expression  signatures  in  response  to  environmental 
perturbations  in  ecological  model  and  environmental 
sentinel  species  such  as  Northern  Bobwhite  {Colinus 
virginianus).  Fathead  Minnow  {Pimephales  promulus). 
Earthworm  {Eisenia  fetida),  Western  Fence  Lizard, 
Japanese  Quail  and.  Staghorn  Coral  {Acropora  formosa) 
[Garcia-Reyero  et  al  2009,  Gong  et  al  2008,  Gong  et  al 
2007,  Gust  et  al  2009,  Rawat  et  al  2010, 
http://ieffifxworks.eom/EGGT/1.  These  gene  expression 
and  cyber-infrastucture  tools  are  proving  to  be 
indispensable  as  the  focus  of  biological  research  and 
regulatory  decision  frameworks  continue  to  shift  toward 
systems  biology  and  predictive  toxicology  approaches 
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2.  MATERIALS  AND  METHODS 

Tissue  samples  used  to  eonstraet  the  normalized 
eDNA  libraries  for  WFL  and  JQ  were  eolleeted  from  five 
eontrol  animals  of  eaeh  sex  for  eaeh  speeies.  The  RNA 
pool  used  to  eonstruet  the  eDNA  library  for  WFL 
ineluded  RNA  extraeted  from  brain,  bone  marrow,  gut, 
heart,  liver,  ovary  and  testes  tissues.  The  library  for  JQ 
ineluded  RNA  extraeted  from  the  adrenal  gland,  brain, 
bursa,  duodenum,  heart,  kidney,  liver,  lung,  ovary, 
pituitary  gland,  spleen,  testes  and  thyroid.  All  protoeols 
were  eondueted  eonsistent  with  Good  Laboratory 
Praetiees  and  approved  by  the  Institutional  Animal  Care 
and  Use  Committee  at  the  U.S.  Army  Center  for  Health 
Promotion  and  Preventative  Medieine. 

2.1  Tissue  Fixation  and  RNA  Extraction 

Immediately  following  euthanasia  by  C02 
asphyxia,  tissue  samples  were  fixed  in  RNA  Later’’'’^ 
(Ambion,  Austin,  TX)  following  manufacturers 
recommendations.  RNA  extraction  was  conducted  using 
RNeasy  Mini  RNA  extraction  kits  (Qiagen  Inc., 
Valencia,  CA).  RNA  quality  was  assessed  using  an 
Agilent  2100  Bioanalyzer  (Agilent  Technologies, 
Waldbronn,  Germany)  with  RNA  6000  Nano  LabChips® 
RNA.  Only  samples  with  a  28s/18s  ratio  >2.0  and  RNA 
integrity  number  (RIN)  >7.0  were  used  for  downstream 
applications.  The  RNA  compilation  for  each  WFL  and 
JQ  included  500ng  of  total  RNA  from  each  of  the  46  and 
44  total  RNA  samples  collected  for  WFL  and  JQ, 
respectively. 

2.3  cDNA  Library  Construction  and  Normalization 

The  SMARTTM  PCR  cDNA  Synthesis  Kit 
(Clonetech  Laboratories  Inc.  Mountain  View,  CA)  was 
utilized  to  reverse-transcribe  1.0  pg  of  the  total  RNA 
sample  into  full  length  cDNAs  for  each  WFL  and  JQ. 
The  cDNA  libraries  were  normalized  prior  to  sequencing 
to  capture  both  high  and  low  abundance  transcripts  using 
the  Trimmer  cDNA  Normalization  Kit  (Evrogen  JSC, 
Moscow,  Russia). 

2.4  cDNA  Sequencing 

The  normalized  cDNA  libraries  for  WFL  and  JQ 
were  individually  sequenced  using  massively-parallel 
pyrosequencing  on  a  GS-FLX  sequencer  using  a  protocol 
to  resolve  400bp  reads.  Briefly,  cDNAs  were  nebulized 
and  size-selected  for  500  to  800  base  pair  fragments. 
Two  primer  sequences.  Adaptor  A  and  Adaptor  B,  were 
ligated  to  the  fragments.  cDNAs  containing  both  an  A 


and  a  B  adaptor  were  melted  into  single  stranded  DNA, 
immobilized  onto  DNA  capture  beads  and  emulsified  in 
oil  for  polymerase  chain  reaction  (emPCR).  The  PCR 
emulsion  was  titrated  to  determine  the  optimal  amount  of 
ssDNA  needed  to  create  a  1:1  DNA  fragment  to  bead 
ratio.  emPCR  was  performed  and  the  amplified  library 
was  loaded  onto  a  70  x  75  mm  PicoTiterPlate  and 
sequenced.  A  full  PicoTiterPlate  was  used  for  each 
library  preparation. 

2.5  DNA  Sequence  Processing  and  Annotation 

Genome-scale  transcriptomes  of  WFL  and  the  JQ 
were  used  for  EST-based  clustering  and  assembly  via 
The  Gene  Indices  Clustering  Tools  (TGICL)  (Pertea  et 
al.  2003),  which  uses  megablast  (Altschul  et  al.  1990)  for 
homology-based  clustering  and  CAP3  (Huang  and 
Madan  1999)  for  assembly.  Unigenes  consisting  of 
contiguous  sequences  (contigs)  and  singlets  longer  than 
200  base  pairs  were  selected  for  blastx  (Altschul  et  al. 
1990)  homology-based  coding  potential  detection  and 
annotation  against  5  sets  of  public  available  protein 
sequence  knowledge  sets:  NCBI  NR.aa  (10,606,545 
proteins)  and  Refseq  (6,392,535  proteins),  EMBL-EBI 
(http://www.ebi.ac.Uk/l  UniProt-SwissProt  (515,203 
proteins),  UnirefPO  (6,544,144  proteins),  and  UnireflOO 
(9,865,668  proteins).  The  CPU  intensive  computational 
biology  analysis  pipelines  of  clustering,  assembly,  and 
annotation  were  mn  via  Portable  Batch  System  (PBS) 
(http://en.wikipedia.org/wiki/Portable  Batch  System) 

through  the  DoD  supercomputers  Diamond,  a  SGI  Altrix 
ICE  with  1920  nodes  and  15,360  cores,  and  Jade,  a  Cray 
XT4  system  with  2146  nodes  and  8,584  cores 
(http://www.erdc.hpc.mil/hardSoft/Hardware/home). 

We  have  implemented  a  mature  bioinformatics  and 
computational  biology  system  which  includes:  (1)  a 
Relational  Database  Management  Systems  (RDBMS)  - 
Oracle  (Oracle,  Redwood  Shores,  CA)  for  quick  data 
retrieval  and  integration;  (2)  public  and  private  data  and 
results  access  via  network  shared  file  servers 
(http://ieffifxworks.com/EGGT/Ouail  Lizard.htmB;  (3) 
data  and  result  visualization,  data  retrieval,  and  data 
mining  via  a  public  accessible  web  server 
(http://www.ifxworks.com/EnvironmentalSvstemsBiolog 

y.html)  and;  (4)  High  performance  and  throughput 
computational  analysis  pipelines  for  quick  data  loading, 
retrieval,  analysis,  processing,  integration,  and  validation 
(Figure  1  and  2). 

Apache  web  server  (http://www.apache.org/).  html 
(http://en.wikipedia.org/wiki/HTML).  JavaScript 

(http://en.wikipedia.org/wiki/JavaScript).  and  Perl  CGI 
(http://perldoc.perl.org/CGLhtml)  were  used  for  data 
web  display  and  development  of  web-based  tools  for 
unigene  and  transcriptome/EST  DNA  sequence  blast  and 
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retrieval  for  data  mining  and  microarray  chip  design 
(Figure  1  and  2). 


sequence  manipulation  such  as  blast  and  sequence 
retrieval  (Figure  2)  for  both  the  raw  and  assembled 


3.  RESULTS 

The  sequencing  effort  produced  over  328  million 
base  reads  for  the  WFL  and  189  million  base  reads  for 
JQ  in  928,780  and  559,833  sequence  reads,  respectively 
(Table  1).  Average  sequence  read  length  for  WFL  was 
354  bases  and  348  for  JQ.  The  sequence  data  sets  were 
used  to  drive  the  development  of  bioinformatics  and 
computational  biology  analysis  pipelines  to  cluster, 
assemble  and  annotate  protein-coding  sequence  data  and 
select  functional  probes  for  microarray  design. 


High  Performance 
and  Throughput 
Computing  using 
Super  Computers 


Batch  Processing 

(1)  Data  Upioading; 

(2)  Data  Vaiidation; 

(3)  Data  Anaiysis; 

(4)  Data  Processing 
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Private  Fiie  Server 


Public  Fiie  Server 


Homology  Search  with  Quail  and  Lizard 
ESTs  and  Unigenes  DNA  Sequences 


Change  background  color  here  :  |  Sea  Green  •»*| 


Blast  program  :  [blastn  ’■*  |  Cut  off:  [^6-5  ^ 

Selected  a  Dataset  for  Search  : 


Sequence  ID;  |my  seq  1234 
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[  Run  Blast  1 1  Reset  | 


Figure  2.  Sequence  blast  and  retrieval  web-based  tools 
for  ESTs  /  transcriptomes  and  unigenes  of  Western  fence 
lizard  and  Japanese  quail. 

Table  1.  Results  of  GS-FLX  Pyrosequencing  of 
normalized  cDNA  Libraries  for  Western  fence  lizard 
(WFL)  and  Japanese  quail  (JQ). 


Figure  1.  Bioinformatics  system  architecture  used  to 
implement  data  management,  data  security  and  web 
accessibility  of  publicly  shared  data  and  computational 
tools. 

A  total  of  559,819  and  928,759  EST  sequences  out 
of  559,833  and  928,780  total  DNA  sequences  after 
removing  control  sequences  for  sequencing  runs, 
respectively,  from  JQ  and  WFL  were  selected  for 
clustering  and  assembly.  In  all,  53,897  contigs  and  5,065 
singlets  totaling  58,962  unigenes  were  identified  for 
WFL,  whereas  41,066  contigs  and  3,389  singlets  totaling 
44,455  unigenes  were  identified  for  JQ  (Table  2). 
Among  the  unigenes  33  to  44  %  of  singlets  and  contigs 
were  annotated  for  protein-coding  potential  via 
homology-based  annotation  against  NCBI  NR.aa  and 
Refseq,  and  EMBL-EBI  UniProt-SwissProt,  Uniref90, 
and  UnireflOO  protein  sequence  reference 
knowledgebases  (Table  3). 

The  annotatable  unigenes  have  been  utilized  to 
design  custom  ultra-high  density  microarrays  via  Agilent 
transcriptomic  profiling  technology.  To  facilitate  access 
to  the  WFL  and  JQ  datasets,  web-based  tools  for  DNA 


Sequencing  Parameters 

WFL 

JQ 

Raw  Wells 

2,125,263 

1,157,019 

Key  Pass  Wells 

2,061,220 

1,103,565 

Passed  Filter  Wells 

928,780 

559,833 

Total  Bases 

328,540,934 

189,239,672 

Length  Average 

354 

338 

Median  Reads  Length 

397 

388 

Longest  Reads  Length 

2,043 

686 

Shortest  Reads  Length 

2 

11 

Table  2.  Summary  of  sequence  clustering  and  assembly 
for  Western  fence  lizard  (WFL)  and  Japanese  quail  (JQ). 


Sequence  Assembly 

WFL 

JQ 

Total  ESTs  Available 

928,759 

559,819 

Total  Assembled  Contigs 

53,897 

41,066 

Total  Singlets 

5,065 

3,389 

Total  Unigenes 

58,962 

44,455 

3 


Table  3.  Unigenes  homology-based  eoding  potential 
deteetion  and  annotation  against  the  following  protein 
databases:  NR.aa  (10,606,545  proteins),  Refseq 

(6,392,535  proteins),  UniProt-SwissProt  (515,203 
proteins),  UnirefPO  (6,544,144  proteins),  UnireflOO 
(9,865,668  proteins).  WFL  and  JQ  represent  Western 
fenee  lizard  and  Japanese  quail,  respeetively. 


Unigene 

Coding 

Non- 

% 

Protein  Database 

Dataset 

Detected 

Coding 

Detected 

Coding 

23,385 

30,512 

43.39% 

NR.aa 

23,173 

30,724 

43.00% 

Refseq 

Contigs 

21,593 

32,304 

40.06% 

UniProt-SwissProt 

23,463 

30,434 

43.53% 

UnireflOO 

23,508 

30,389 

43.62% 

Uniref90 

1,425 

1,825 

44.33% 

NR.aa 

1,440 

1,837 

43.94% 

Refseq 

Singlets 

1,457 

1,820 

44.46% 

UniProt-SwissProt 

1,465 

1,812 

44.71% 

UnireflOO 

1,298 

1,979 

39.61% 

Uniref90 

17,873 

23,193 

43.52% 

NR.aa 

JQ 

Contigs 

17,732 

23,334 

43.18% 

Refseq 

15,513 

25,553 

37.78% 

UniProt-SwissProt 

18034 

23,032 

43.92% 

UnireflOO 

18,031 

23,035 

43.91% 

Uniref90 

1,208 

2,181 

35.65% 

NR.aa 

JQ 

Singlets 

1,195 

2,194 

35.26% 

Refseq 

1,140 

2,249 

33.64% 

UniProt-SwissProt 

1,217 

2,172 

35.91% 

UnireflOO 

1,211 

2,178 

35.73% 

Uniref90 

transeriptomes  for  eaeh  speeies  have  been  developed  for 
data  mining  and  mieroarray  ehip  design 
(http://ieff.ifxworks.eom/EGGT/Analvsis  Tools.html). 

CONCLUSIONS 

A  eomputational  biology  analysis  pipeline  and  web- 
based  bioinformaties  data  visualization  toolset  for  use  in 
data  retrieval,  data  mining,  and  genomies  data  proeessing 
have  been  established  at  the  ERDC  Environmental 
Laboratory  for  genome-seale  transeriptome  elustering, 
assembly,  and  annotation  for  the  benefit  of 
environmental  sentinel  speeies.  The  annotated  unigenes 
and  designed  ultra-high  density  mieroarrays  will  be  used 
in  support  of  assuring  environmental  quality  on 
installations.  The  development  of  this  infrastrueture  has 
enabled  a  broad  eapability  to  develop  highly  robust 
monitoring  tools  to  deteet  and  assess  the  impaets  of 
environmental  stressors  on  Army  ranges,  to  provide 
direeted  management  for  any  target  speeies  found  to  be 


of  eoneem  on  Army  ranges.  Speeifie  examples  of  how 
tools  developed  for  WFL  and  JQ  are  being  used  to 
support  R&D  for  sustainment  of  the  Army  mission 
inelude:  (1)  Determination  of  eorreeted  inter-speeies 

uneertainty  faetors  for  avian  speeies  that  will  improve 
the  aeeuraey  (and  likely  adjust  highly  eonservative 
assumptions)  of  eeologieal  risk  assessment.  (2) 
Determination  of  the  influenee  of  habitat  degradation  and 
global  elimate  ehange  on  the  toxieity  and  impaets  of 
munitions  eompounds  on  the  reptile  model,  WFL. 
Overall,  this  work  has  provided  the  infrastrueture  and 
tools  to  ensure  population  sustainability  on  Army  ranges 
whieh  in  turn  supports  sustainability  of  the  Army 
mission  through  enhaneed  and  assured  range  aeeess. 
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